Google's Gemini now "sees" videos and interprets visual, audio content

The new capability allows Gemini to "see" what’s happening in videos, including recognising people, objects, actions, emotions, and spoken dialogue and then generate detailed summaries or insights in response to user prompts.
Google has officially announced that its Gemini AI model is now capable of analysing and interpreting video content, marking a significant milestone in the development of multimodal artificial intelligence.
The new capability allows Gemini to "see" what’s happening in videos, including recognising people, objects, actions, emotions, and spoken dialogue and then generate detailed summaries or insights in response to user prompts.
More To Read
- Meta AI now integrated into WhatsApp: How it works and how it compares to ChatGPT
- Tanzania blocks social media platform X over pornographic content
- Traditional film industry facing AI replacement with Google’s new AI tool
- The future of leadership? AI replacing human CEOs
- Google unveils tool to flag AI-generated content
- AI-driven motion capture is transforming sports and exercise science
“We’re expanding what’s possible with Gemini by introducing video understanding. This allows users to upload or link to video content and receive rich, contextual responses that reflect a deep understanding of both visual and audio information over time.” Google said in a statement.
The feature positions Gemini alongside competitors like xAI’s Grok and OpenAI’s GPT-4o, both of which have recently pushed into video comprehension as part of the broader race to develop real-time, multimodal AI assistants.
Early demonstrations show Gemini capable of analysing sports clips, offering scene-by-scene breakdowns of films, identifying safety violations in workplace footage, and assisting in educational settings by simplifying complex video-based lessons.
The model can also transcribe dialogue, describe tone and facial expressions, and summarise content with precision.
Tech analysts suggest that this development could reshape industries from media and education to compliance and accessibility, offering enhanced video indexing, auto-captioning, content moderation, and personalised learning.
However, experts also caution that video comprehension raises new ethical and privacy challenges, especially if used at scale.
Critics urge AI companies to build in safeguards that ensure such technologies are not used for mass surveillance or unauthorised content analysis.
Google has not yet confirmed when the new video capabilities will be widely available to the public, but sources indicate a staged rollout beginning with developers and Gemini Advanced users is expected later this year.
Top Stories Today